Skip to content

CI: auto-trigger AITER prebuilt upload when 3rdparty/aiter updates on dev#543

Open
VeeraRajasekhar wants to merge 7 commits into
devfrom
veergopu/aiter-prebuilt-upload-on-dev
Open

CI: auto-trigger AITER prebuilt upload when 3rdparty/aiter updates on dev#543
VeeraRajasekhar wants to merge 7 commits into
devfrom
veergopu/aiter-prebuilt-upload-on-dev

Conversation

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor

Description

This adds CI to build and upload AITER prebuilts to AMD Artifactory when the AITER submodule pointer or contents under 3rdparty/aiter change on dev, so prebuilts stay aligned with the submodule without relying only on manual runs.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Add .github/workflows/aiter-prebuilt-upload.yml: push to dev with paths: 3rdparty/aiter.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@VeeraRajasekhar VeeraRajasekhar force-pushed the veergopu/aiter-prebuilt-upload-on-dev branch from da019d7 to 5240d8d Compare April 15, 2026 20:37
@VeeraRajasekhar VeeraRajasekhar marked this pull request as ready for review April 15, 2026 20:37
@VeeraRajasekhar VeeraRajasekhar self-assigned this Apr 15, 2026
@Micky774
Copy link
Copy Markdown
Contributor

What happens if e.g. an aiter prebuilt cache entry was already created from a branch before it gets merged into dev, thus on merge this action would not be needed -- does it still run and overwrite the entry? Does it cancel gracefully?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing to dev is too late. Binaries are expected to be cached when PR is created, otherwise the PR CI will have to rebuild them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially started with an idea to provide a PR comment trigger which does this. Do you think it is better? In this case I might need to provide a way to force-push if the user needs to trigger multiple times while working on the PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it can be driven by CI labels + filter by specific path modification. I.e. on the first CI run after aiter commit update, it first builds and uploads AITER and then goes further with CI.

Copy link
Copy Markdown
Contributor

@Micky774 Micky774 Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, can we have an upload as a side effect during the build-and-test workflow? That would provide a relatively simple way to implement this.

Specifically, we can do this more-or-less as-is by using the following filtering for a prebuilt cache upload:

  on:                                                                                                                                       
    pull_request: 
      paths:
        - '3rdparty/aiter'
        - '3rdpart/aiter/***'

Not sure which one exactly is needed since aiter is a submodule but one should work. Still, I'd be more interested in conditionally checking whether the AITER submodule was built from source, and then uploading if it was in the build-and-test flow.

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor Author

What happens if e.g. an aiter prebuilt cache entry was already created from a branch before it gets merged into dev, thus on merge this action would not be needed -- does it still run and overwrite the entry? Does it cancel gracefully?

It will cancel It won't force push or anything.

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor Author

rocm-ci-dispatch.yml

  • Job aiter_upload_trigger: paths-filter on 3rdparty/aiter vs the PR base → sets trigger_aiter_upload.
  • dispatch calls rocm-ci.yml with test_level + trigger_aiter_upload (same entry path as existing PR CI).

rocm-ci.yml

  • select_docker_image: resolves the Docker image once from ci/ci_config.json (shared script: select_te_docker_image_ci_config.sh).
  • upload_aiter_prebuilt: runs only when workflow_call / workflow_dispatch and inputs.trigger_aiter_upload — i.e. PR CI (or manual), not on push to dev/release. It uses aiter-prebuilt-upload.yml and passes the resolved image.
  • build_and_test: GPU matrix; waits for image resolution and for upload to succeed or be skipped (skipped on post-merge push, where we assume prebuilts were already published from the PR).

aiter-prebuilt-upload.yml

  • Reusable workflow: workflow_call takes required docker_image from rocm-ci; workflow_dispatch unchanged for manual runs.

@ipanfilo
Copy link
Copy Markdown
Collaborator

Big rocm-ci refactor is in progress #528, let's postpone this PR till refactoring one is merged

Copy link
Copy Markdown
Collaborator

@ipanfilo ipanfilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the PR on hold not to interfere with other WIP

Comment thread .github/workflows/rocm-ci.yml Outdated
type: boolean
default: false
trigger_aiter_upload:
description: 'Advanced; PR path uses rocm-ci-dispatch. Default false.'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflow_dispatch is not PR triggered action. Also no need to specify default in description

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor Author

rebasing with the latest changes to rocm-ci file.

@VeeraRajasekhar VeeraRajasekhar force-pushed the veergopu/aiter-prebuilt-upload-on-dev branch from a4fd1df to 2a63432 Compare April 30, 2026 22:34
@VeeraRajasekhar VeeraRajasekhar force-pushed the veergopu/aiter-prebuilt-upload-on-dev branch from 109e32e to 52be5b1 Compare April 30, 2026 23:27
@Micky774
Copy link
Copy Markdown
Contributor

Micky774 commented May 1, 2026

Rather than adding a label to skip the pre-built upload, can we instead have it be opt-in so that we don't burn extra resources on intermediate changes and iteration?

@VeeraRajasekhar VeeraRajasekhar added the ci-level 1 CI test level 1 label May 1, 2026
@VeeraRajasekhar
Copy link
Copy Markdown
Contributor Author

Rather than adding a label to skip the pre-built upload, can we instead have it be opt-in so that we don't burn extra resources on intermediate changes and iteration?

I will be removing this “skip prebuilt upload” label. The original idea was to handle cases where Artifactory was unreachable from the runner, but that’s no longer the main control: we now have preflight checks that test whether the AITER prebuilt Artifactory is reachable from the current runner. If it isn’t, we skip the upload and still run CI. So we don’t need a separate skip label for that anymore.

As for keeping an Aiter-Upload label, doing this would force an awkward ordering (“only add ci-level-* after upload succeeds”) and extra process overhead. It would also waste effort when people set test labels first and we build but never publish.

Current behavior
Labels stay as they are. Dispatch can wait until the AITER upload step finishes (success or skipped after preflight). Preflight determines reachability: if Artifactory isn’t reachable, we skip upload and continue with tests instead of failing or burning upload work unnecessarily.

@VeeraRajasekhar VeeraRajasekhar removed the ci-level 1 CI test level 1 label May 2, 2026
Wire PR Automatic CI to optionally build and upload AITER prebuilts to Artifactory
when a synchronize touches 3rdparty/aiter and the upload-success cache misses.
Reuse select-docker-image for consistent image resolution across dispatch, rocm-ci,
and the upload workflow.

- rocm-ci-dispatch: gate on aiter paths, compute aiter_upload_cache_key from Docker
  image slug and submodule SHA, restore cache before deciding trigger_aiter_upload,
  pass docker_image_override and flags into rocm-ci.
- rocm-ci: call aiter-prebuilt-upload on pull_request when trigger_aiter_upload;
  pass image tag and cache key; keep build gated on select_image and upload success
  or skip.
- aiter-prebuilt-upload: always run select_docker_image; pull/run using
  needs.select_docker_image.outputs.image_tag; optional cache save after success.
- Add select-docker-image reusable workflow for ci/ci_config.json resolution.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Claude Walkthrough

Intent. Auto-build and upload the AITER prebuilt to AMD Artifactory whenever a PR changes 3rdparty/aiter so cached prebuilts stay in sync with the submodule without relying solely on manual workflow_dispatch runs. Adds reachability preflights so misconfigured tokens or network paths fail loudly before doing real work, and factors out a shared docker-image resolver so the dispatcher and the upload job pick the same image.

Key changes.

  • New reusable Select Docker Image workflow at .github/workflows/select-docker-image.yml — extracted from inline logic previously in rocm-ci.yml; output renamed image-tagimage_tag.
  • aiter-prebuilt-upload.yml becomes reusable (workflow_call) and consumes select-docker-image.yml; renames input docker_imagedocker_image_override; adds optional aiter_upload_cache_key to record a per-(aiter SHA, image) success marker via actions/cache/save (.github/workflows/aiter-prebuilt-upload.yml:14, :78, :96).
  • rocm-ci-dispatch.yml gains an aiter_gate (paths-filter on 3rdparty/aiter/**) and an aiter_prebuilt_upload_trigger job that computes cache_key=aiter-prebuilt-upload-ok-${IMAGE_SLUG}-${AITER_SHA} and skips the upload on a cache hit (.github/workflows/rocm-ci-dispatch.yml:34, :55, :74).
  • rocm-ci.yml adds an upload_aiter_prebuilt job that calls the reusable upload workflow when trigger_aiter_upload is true, and gates build on it succeeding or being skipped (.github/workflows/rocm-ci.yml:74, :85).
  • aiter_prebuild_upload.sh grows two --preflight modes (--upload, --download) that ping Artifactory and HEAD a probe URL — authenticated for upload, anonymous for download — accepting 200 or 404 as success (.github/scripts/aiter_prebuild_upload.sh:11, :19, :46, :63).
  • rocm-wheels-build.yml reads NVTE_AITER_PREBUILT_BASE_URL from vars instead of hardcoding the Artifactory URL, and runs the anonymous download preflight (non-blocking via continue-on-error) when use_prebuilt_aiter is set (.github/workflows/rocm-wheels-build.yml:79, :94).

Walkthrough.

select-docker-image.yml is a verbatim lift of the previous select_image job in rocm-ci.yml: sparse-checks out ci/ci_config.json (from caller ref when test_config_from_source, otherwise default branch), picks the entry keyed by github.base_ref || github.ref_name with a default fallback, and lets docker_image_override win. It now lives once and is reused by rocm-ci.yml, rocm-ci-dispatch.yml, and aiter-prebuilt-upload.yml, so all three resolve to the same image for a given PR run.

rocm-ci-dispatch.yml adds the trigger logic. aiter_gate only runs on synchronize and uses dorny/paths-filter@v4 to detect 3rdparty/aiter/** changes. If matched, aiter_prebuilt_upload_trigger checks out the PR head, derives AITER_SHA = git rev-parse HEAD:3rdparty/aiter (the submodule pointer), hashes the resolved image tag into IMAGE_SLUG, and tries to restore an Actions cache marker for that key. The dispatcher then forwards trigger_aiter_upload = synchronize && aiter changed && cache miss and the cache key down to rocm-ci.yml. The image_tag is also forwarded as docker_image_override so the downstream resolver short-circuits to the same value.

rocm-ci.yml swaps its inline image-selection job for a uses: of the new reusable workflow, makes docker_image_override required, and adds upload_aiter_prebuilt which calls aiter-prebuilt-upload.yml when trigger_aiter_upload is true. build now needs: [select_image, upload_aiter_prebuilt] with if: always() && select_image.success && (upload skipped or success) so test-only PRs don't pay the upload cost but PRs that bump aiter must successfully upload before building wheels. All other jobs are mechanically updated for image-tagimage_tag.

aiter-prebuilt-upload.yml is restructured to be invokable both manually (workflow_dispatch) and from another workflow (workflow_call). Env vars NVTE_AITER_PREBUILT_BASE_URL (from vars) and NVTE_AITER_PREBUILT_UPLOAD_TOKEN (from secrets) are hoisted to job level so the preflight step and the docker exec both see them. The new "Preflight: Artifactory upload reachability" step runs the script in --preflight --upload mode before pulling docker, so an Artifactory or token problem fails fast on the host runner. On success, when called with aiter_upload_cache_key, it writes a small .aiter-upload-success marker and saves it to the Actions cache under that key — this is what lets the dispatcher skip re-uploads for the same (aiter SHA, image) pair on later synchronizes. permissions: actions: write is required for the cache save.

aiter_prebuild_upload.sh previously only had a --build flag. The new preflight branch sits at the top of the script and exits before the build/package logic. _aiter_set_artifactory_check_urls derives ${BASE%%/artifactory/*}/artifactory/api/system/ping and a probe URL ${BASE}/__aiter_repo_access_probe_not_a_real_artifact so HEAD returns 404 on a healthy repo (treated as success alongside 200); any other code is a hard failure. Upload mode requires the bearer token; download mode is anonymous to mirror what CMake's file(DOWNLOAD) actually does.

rocm-wheels-build.yml stops hardcoding the Artifactory URL (now vars.NVTE_AITER_PREBUILT_BASE_URL at job env) and adds a continue-on-error: true download preflight gated on inputs.use_prebuilt_aiter. It's intentionally non-fatal — a transient Artifactory blip shouldn't kill a wheel build that can still fall back to building AITER from source — but it surfaces as a ::warning:: so failures are visible.

Testing. No automated tests were added; this is a CI-only change. Validation comes from observing aiter-prebuilt-upload and rocm-ci-dispatch runs once merged (and from the preflight steps themselves, which act as live smoke tests against Artifactory).

Notes for reviewers.

  • The PR description says the trigger is push to dev with paths: 3rdparty/aiter, but the implemented mechanism is per-PR synchronize-driven, gated by an Actions-cache success marker keyed on (IMAGE_SLUG, AITER_SHA). Worth confirming that's the intended model.
  • Cache scope: actions/cache is keyed per-branch with fallback to the default branch. A success marker written from a feature branch may not be visible from another feature branch, leading to redundant (but presumably idempotent) re-uploads. Worth confirming Artifactory upload of an existing artifact is overwrite-safe.
  • Renaming docker_imagedocker_image_override in aiter-prebuilt-upload.yml is a breaking change for any out-of-tree caller still passing the old name.
  • Renaming output image-tagimage_tag is a breaking change for any external workflow consuming select_image outputs from rocm-ci.yml.
  • permissions: actions: write is added to both aiter-prebuilt-upload.yml and rocm-ci-dispatch.yml to enable the cache writes; this is the minimum needed for actions/cache/save and actions/cache/restore.
  • The probe URL __aiter_repo_access_probe_not_a_real_artifact relies on Artifactory returning 404 (not 403) for missing artifacts in an accessible repo — true for the default anonymous-read config but worth confirming against the specific repo policy.

Generated by Claude. To request a code review, comment /claude review.

- name: Detect PR changes under 3rdparty/aiter
uses: dorny/paths-filter@v4
id: paths
if: github.event.action == 'synchronize'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this if: is on the step, not the job, so aiter_gate still spins up an ubuntu-latest runner on every labeled and reopened event just to skip its only step. Moving the gate to the job level (and letting the downstream aiter_prebuilt_upload_trigger see an empty aiter_paths output, which the existing expression already treats as falsy) would avoid that.

Suggested change
if: github.event.action == 'synchronize'
- name: Detect PR changes under 3rdparty/aiter
uses: dorny/paths-filter@v4
id: paths
with:
filters: |
aiter:
- '3rdparty/aiter/**'

…paired with if: github.event.action == 'synchronize' on the aiter_gate job itself.


echo "Selected image: $IMAGE_TO_USE"
echo "image-tag=$IMAGE_TO_USE" >> $GITHUB_OUTPUT
uses: ./.github/workflows/select-docker-image.yml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select-docker-image.yml ends up being invoked three times along this chain (dispatch.yml → rocm-ci.yml here → aiter-prebuilt-upload.yml), once per layer. The two inner invocations always receive the already-resolved tag via docker_image_override, so they only run the reusable workflow's JSON lookup to overwrite it with the override — three runners and three checkouts for the same answer.

If the override is non-empty, you can skip the reusable workflow and emit the override directly as select_image.outputs.image_tag (e.g., a tiny job whose only step echoes it to $GITHUB_OUTPUT), and reserve the full reusable workflow for the workflow_dispatch path. Same for aiter-prebuilt-upload.yml.

_aiter_check_artifactory_download
;;
*)
echo "Usage: $(basename "$0") --preflight --upload | --preflight --download" >&2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage string omits the script's primary modes (no-arg package-only and --build). When someone misuses --preflight and lands on this branch they'll be told only the preflight forms exist.

Suggested change
echo "Usage: $(basename "$0") --preflight --upload | --preflight --download" >&2
echo "Usage: $(basename "$0") [--build]" >&2
echo " $(basename "$0") --preflight --upload | --preflight --download" >&2

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Claude review

Reviewed the AITER auto-upload + shared docker-image selector changes (6 CI files, +319 / -79). The flow looks sound: PR-side path filter on 3rdparty/aiter, cache-keyed marker so re-runs skip, build/upload via the existing aiter_prebuild_upload.sh --build, and a centralized select-docker-image.yml reusable workflow.

Three minor inline notes — none blocking:

  • aiter_gate job-level vs step-level if: (runner waste on labeled/reopened).
  • select-docker-image.yml is invoked three times along the dispatch chain when the override is already set.
  • aiter_prebuild_upload.sh --preflight usage string is missing the primary modes.

Worth confirming with the author:

  • On a brand-new PR (no synchronize event), the upload trigger never fires; subsequent label/reopen events see aiter_paths == ''. The first push after open will trigger it, so this is mostly a corner case, but the design assumption is worth a comment in the workflow.
  • dispatch uses always() and only checks select_ci_image.result, so a transient failure in aiter_prebuilt_upload_trigger silently degrades to "no upload". Acceptable as fail-soft, but the build path would then fall back to source-build of AITER without warning.

Copyright headers: OK — all six files (4 modified, 1 added, 1 unchanged-header script) have correct AMD-only headers ending in 2026.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants